Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose cuda device health status in /healthz endpoint #1056

Merged
merged 7 commits into from
Dec 4, 2024

Conversation

papa99do
Copy link
Collaborator

@papa99do papa99do commented Nov 28, 2024

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    Operation improvement

  • What is the current behavior? (You can also link to an open issue here)
    Cuda device can silently fail, which causes weird behaviour of Marqo

  • What is the new behavior (if this is a feature change)?
    Expose a /healthz endpoint to be called to check cuda device status. This endpoint will return 500 errors when

  • Cuda device becomes unavailable
  • Cuda device is out of memory
    This endpoint can be used by any scheduling framework as a liveness check for Marqo container.
  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
    No

  • Have unit tests been run against this PR? (Has there also been any additional testing?)
    Yes

  • Related Python client changes (link commit/PR here)
    No

  • Related documentation changes (link commit/PR here)
    N/A

  • Other information:

  • Please check if the PR fulfills these requirements

  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

@papa99do papa99do changed the title yihan/cuda-issue-mitigation Support cuda device health check in k8s liveness check Nov 28, 2024
@papa99do papa99do changed the title Support cuda device health check in k8s liveness check Expose cuda device health status in /healthz endpoint Nov 28, 2024
@papa99do papa99do marked this pull request as ready for review November 28, 2024 05:04
@papa99do papa99do force-pushed the yihan/cuda-issue-mitigation branch 2 times, most recently from d7c5d1c to ae4278e Compare December 3, 2024 05:46
src/marqo/core/inference/device_manager.py Show resolved Hide resolved
src/marqo/tensor_search/on_start_script.py Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Outdated Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Outdated Show resolved Hide resolved
src/marqo/core/exceptions.py Show resolved Hide resolved
src/marqo/tensor_search/api.py Outdated Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Show resolved Hide resolved
src/marqo/core/monitoring/monitoring.py Show resolved Hide resolved
tests/core/inference/test_device_manager.py Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Outdated Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Outdated Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Outdated Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Outdated Show resolved Hide resolved
src/marqo/core/inference/device_manager.py Show resolved Hide resolved
src/marqo/core/monitoring/monitoring.py Show resolved Hide resolved
src/marqo/tensor_search/api.py Show resolved Hide resolved
src/marqo/tensor_search/api.py Show resolved Hide resolved
@papa99do papa99do merged commit dfc1ed8 into mainline Dec 4, 2024
9 checks passed
@papa99do papa99do deleted the yihan/cuda-issue-mitigation branch December 4, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants